Chromosome level genome assembly of endangered medicinal plant Anisodus tanguticus

Anisodus tanguticus is a medicinal herb that belongs to the Anisodus genus of the Solanaceae family. This endangered herb is mainly distributed in Qinghai–Tibet Plateau. In this study, we combined the Illumina short-read, Nanopore long-read and high-throughput chromosome conformation capture (Hi-C) sequencing technologies to de novo assemble the A. tanguticus genome. A high-quality chromosomal-level genome assembly was obtained with a genome size of 1.26 Gb and a contig N50 of 25.07 Mb. Of the draft genome sequences, 97.47% were anchored to 24 pseudochromosomes with a scaffold N50 of 51.28 Mb. In addition, 842.14 Mb of transposable elements occupying 66.70% of the genome assembly were identified and 44,252 protein-coding genes were predicted. The genome assembly of A. tanguticus will provide genetic repertoire to understand the adaptation strategy of Anisodus species in the plateau, which will further promote the conservation of endangered A. tanguticus resources.


Background & Summary
The perennial medicinal herb Anisodus tanguticus is a member of Anisodus genus that is distributed in Qinghai-Tibet Plateau. A. tanguticus was named as "Tang Chun Na Bao" in the traditional Tibetan medicine 1 .Its roots were used by the local Tibetan healers to treat septic shock, ulcers, colitis, spasms and reduce pain 1,2 .The main active components of A. tanguticus roots are tropane alkaloids, such as hyoscyamine, anisodamine, and scopolamine 3 .These tropane alkaloids are the competitive, reversible antagonists of muscarinic acetylcholine receptors, and are clinically used for the treatment of motion sickness, spasticity, obstetrical analgesia, septic shock, organophosphate poisoning, Parkinson's symptoms, etc 2,4 .Besides, atropine (racemic hyoscyamine) was listed as the most efficacious, safe, and cost-effective medicines for priority conditions in the World Health Organization model list of essential medicines (https://www.who.int/publications/i/item/WHO-MHP-HPS-EML-2021.02).In addition to the well-known tropane alkaloids, numerous terpenoids, indolizidine-and pyrrolidine-type alkaloids and cinnamoylphenethylamides with pharmacological activity have been isolated from A. tanguticus [5][6][7][8] .Due to the important medicinal value, A. tanguticus has been massively exploited and collected, resulting in the depletion of its wild resources.
In the Anisodus genus, there are four species and three varieties, such as A. tanguticus, A. luridus, A. acutangulus, and A. mairei 9 .These four species are mainly distributed in the plateau (mainly the Qinghai-Tibet Plateau) at altitudes ranging from 2,680 to 4,200 m, and A. tanguticus was observed to survive at a higher altitude environment than A. acutangulus 9 .Although the genome of A. acutangulus has been assembled to explore the evolution of tropane alkaloid biosynthesis 10 , few is known about the adaptation strategy of Anisodus species to overcome the adverse environment, such as the complex land conditions or the diverse climate.Recently, the chloroplast genome of A. tanguticus was sequenced to study the adaptation strategy of A. tanguticus in the Qinghai-Tibet Plateau 11,12 .The chloroplast genetic information accounts for only a small part of the whole genetic information of A. tanguticus, and most genetic information is deposited within the chromosomal DNA.Thus, a high-quality chromosomal-level genome is necessary to provide genetic information to understand the evolutionary process of the Anisodus genus and the adaptation strategy of Anisodus species in the plateau, which will also promote the conservation of endangered A. tanguticus resources.
In this paper, we generated a high-quality chromosomal-level genome assembly of A. tanguticus based on the Illumina short-read sequencing (182.98Gb), Nanopore long-read sequencing (128.34Gb) and Hi-C sequencing (136.90Gb).The assembled genome, composed of 276 contigs, had a genome size of 1.26 Gb with a contig N50 of 25.07 Mb (Table 1).These contigs were anchored to 24 pseudochromosomes, with an anchoring rate of 97.47% and a scaffold N50 of 51.28 Mb (Table 1, Fig. 1).Of this genome assembly, 66.70% (842.14Mb) were transposable elements with a major component of long terminal repeats (LTRs), which accounted for 44.51% (Tables 1, 2).Meanwhile, 44,252 protein-coding genes composed the final gene repertoire of A. tanguticus (Table 1).This high-quality genome will provide a genetic basis for understanding the adaptive evolution of A. tanguticus in the plateau.

Sample collection and genomic DNA extraction. The seeds of A. tanguticus were collected from
Qilian, Qinghai Province, China, and stored in the Germplasm Bank of Wild Species in Southwest China. A. tanguticus plants were cultivated in the Kunming Institute of Botany of the Chinese Academy of Sciences, Yunnan Province, China.Young leaves from an individual A. tanguticus plant were collected and then used for genomic DNA (gDNA) extraction following the modified cetyltrimethylammonium bromide (CTAB) protocol 13 .The purity and quality of extracted gDNA were examined by NanoPhotometer spectrophotometer (Implen, USA) and agarose gel electrophoresis.Three different tissue samples, including leaf, stem, and root, were collected from an individual cultivated A. tanguticus plant, and used for RNA extraction.
Illumina sequencing and genome survey analysis.High-quality gDNA was randomly fragmented by ultrasonic oscillation (Covaris, USA) and used for Illumina short-read sequencing.According to the protocol of TruSeq DNA Sample Preparation Guide (Illumina, USA), the sequencing libraries were constructed with 350 bp insert size.Then, these libraries were sequenced on the Illumina NovaSeq 6000 platform (Illumina, USA) with a mode of paired-end 150 bp at Benagen Technology Co., Ltd.(Wuhan, China).After removing low-quality reads, the resulting 182.98 Gb clean data were used for the survey analysis of A. tanguticus genome and the polish of preliminary assembly.
The frequencies of 19-kmer were generated by Jellyfish (version 2.2.10) based on the clean data and used for the genome evaluation by GenomeScope (version 2.0) (Fig. 2a) 14,15 .As a result, the genome size of A. tanguticus was estimated as 1.35 Gb, which was consistent with the genome size (~1.5 Gb) measured by flow cytometry (Fig. 2b).Meanwhile, the heterozygous ratio and the repeat content were estimated as 0.37% and 60.0%, respectively.
Nanopore sequencing and draft genome assembly.For nanopore long-read sequencing, its libraries were constructed under the protocol of SQK-LSK110 Ligation Sequencing Kit (Nanopore, UK).The prepared libraries were loaded on flow cells (R9.4) and sequenced on the Nanopore PromethION platform (Nanopore, UK).After removing low-quality reads, a total of 128.34 Gb of clean data, composed of 8.22 million reads, were obtained.The N50 read length was 32.63 kb and the longest nanopore read length was 394.22 kb.
The preliminary assembly was generated by NextDenovo (https://github.com/Nextomics/NextDenovo)with 128.34 Gb clean nanopore data.Subsequently, Racon (version: 1.4.11) 16was used to polish the preliminary assembly with nanopore long-reads through two iterations.Pilon (version: 1.23) 17 was used to polish the preliminary assembly with Illumina short-reads through two iterations.As a result, the draft genome of A. tanguticus was assembled with a total length of 1.26 Gb, composed of 276 contigs and the contig N50 was 25.07 Mb (Table 1).
The valid interaction pairs were identified by HiCUP (version: 0.8.0) and used to construct chromosome-scale assemblies by ALLHiC (version: 0.9.8) 19,20 .Finally, 97.47% of the draft genome sequences (1.23 Gb) were anchored to 24 pseudochromosomes of A. tanguticus and the final chromosome-scale assembly was composed of 131 scaffolds with a scaffold N50 of 51.28 Mb (Table 1, Fig. 3).

Genome annotation.
Repeat sequences were identified by combining homology-based predictions and ab initio predictions.Firstly, RepeatMasker (version: 4.0.9) was used for homology-based prediction of the repeat sequences [i.e."TE (transposable element) proteins" column in Table 2] in the genome assembly based on the Repbase database 21,22 .Secondly, RepeatModeler (version: 1.0.11) was used for ab initio prediction of the repetitive sequences to construct a A. tanguticus-specific repeat library 23 .This library was also used to annotate the repeat sequences (i.e."De novo + Repbase" column in Table 2) of genome assembly by RepeatMasker (version: 4.0.9) 21.These two repeat sequences were combined to obtain the final repeat sequences (i.e."Combined TEs" column in Table 2), which accounted for 66.70% of the genome assembly.Protein-coding genes were predicated by a combination of transcriptome-based prediction, ab initio predication and homologous predication.For transcriptome-based prediction, the RNA of three different tissues, including leaf, stem, and root, were used for the RNA sequencing.Stringtie (version: 2.1.4)and TransDecoder (version: 5.1.0,https://github.com/TransDecoder/TransDecoder)were used to predict the transcriptome-based genes 24 .GlimmerHMM (version: 3.0.4)and Augustus (version: 3.3.2) were used for the ab initio prediction 25,26 .Exonerate (version: 2.4.0) was used for homologous gene prediction with genes from Solanum lycopersicum (Sly), Capsicum annuum (Can), Nicotiana attenuate (Nat) and Solanum tuberosum (Stu) 27 .These predicated genes were integrated into 44,282 genes by MAKER (version: 2.31.10,Table 3) 28 .These protein-coding genes were annotated with protein sequence databases, including universal protein (Uniprot) 29 , protein families database (Pfam) 30 , gene ontology (GO) 31 , Kyoto encyclopedia of genes and genomes (KEGG) 32 , KEGG pathway database, interproscan database 32 , and nonredundant protein sequence (NR, https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins).97.36% of protein-coding genes (43,112 genes) were annotated by at least one database (Table 4).In addition, 30 predicted genes with an intron less than 10 bp were designated as pseudogenes and eliminated in the gene repertoire of A. tanguticus, which led to a final gene count of 44,252.The rRNA genes were predicated with rRNA database and the tRNA genes were predicated by tRNAscan-SE (version: 1.23) 33 .The non-coding RNAs were predicated by INFERNAL (version: 1.1.2) based on the Rfam database 34,35 .Finally, 2,758 tRNAs, 898 rRNAs, 1,821 snRNAs and 269 miRNAs were identified in A. tanguticus.
Based on the analysis of constructed phylogenetic tree and clustered gene families, 1820 and 2537 gene families were expanded and contracted in the A. tanguticus genome by CAFE analysis (version: 4.2.1) 47, respectively (Fig. 4).Of these, 161 expanded gene families and 42 contracted gene families were statistically significant (Table 5).The significantly expanded 161 gene families were enriched in 38 GO terms, involved in "DNA metabolic process", "DNA integration" and "mitochondrion" (Table 6), which were probably related to strong UV radiation and low temperature in the plateau.

Data Records
The A. tanguticus genome project has been deposited in the NCBI database under BioProject accession PRJNA1018692.The genome assembly and gene annotation have been deposited at GenBank under the WGS accession JAVYJV000000000 48 .The genomic Illumina sequencing data were deposited in the SRA at NCBI SRR26127850 49 .The nanopore sequencing data were deposited in the SRA at NCBI SRR26213735 50 .The Hi-C sequencing data were deposited in the SRA at NCBI SRR26152880 51 .The transcriptomic sequencing data were deposited in the SRA at NCBI SRR26156612-SRR26156618 [52][53][54][55][56][57][58] .

technical Validation
Evaluation of the genome assembly.The quality of the genome assembly of A. tanguticus was evaluated based on the contiguity, completeness, and correctness.For contiguity, Hi-C interaction analysis showed apparent interactions among the 24 pseudochromosomes, which was consistent with the reported chromosomes numbers of A. tanguticus 59 .Moreover, 97.47% of the draft genome sequences were oriented and ordered in the 24 pseudochromosomes, with a N50 of 51.28 Mb, suggesting a high contiguity of this genome assembly.For completeness, 97.83% complete BUSCO (benchmarking universal single-copy orthologs) genes in the genome assembly of A. tanguticus were retrieved by BUSCO (version: 5.2.2) analysis with embryophyta_odb10 database 60 .Additionally, the fragmented and missing BUSCO genes accounted for only 0.25% and 1.92%, respectively.For correctness, all Illumina short reads were mapped to the genome assembly by BWA 61 , with a high map rate of 99.96% in the genome assembly.Overall, the quality of the genome assembly was assessed as high contiguity, completeness, and correctness.
Evaluation of the gene repertoire.The final gene repertoire of A. tanguticus comprised 44,252 protein-coding genes, while 38,388 or 38,128 protein-coding genes were predicted in the genome of A. acutangulus 10,62 .Given the phylogenetic proximity of A. tanguticus and A. acutangulus (Fig. 4), we compared the gene repertoires of these two species, focusing on both syntenic genes and non-syntenic genes.For syntenic genes, 34,447 genes in A. tanguticus genome corresponded to 33,162 genes in A. acutangulus genome (Table 7).For non-syntenic genes, 9,805 and 4,966 genes were predicated in A. tanguticus and A. acutangulus genome, respectively.The difference of gene repertoires of these two species mainly stemmed from the non-syntenic genes, which could result from the potential species-specific genes' variation or a more detailed annotation of protein-coding gene in the A. tanguticus genome.

Fig. 1
Fig. 1 The genome assembly and annotation of A. tanguticus.(a) Circular map of A. tanguticus.The 24 outer lines represent 24 pseudochromosomes (Chr1−24).The blue and red bands represent the density of transposable elements and protein-coding genes, respectively.The inner lines represent syntenic blocks in the A. tanguticus assembly.(b) Photograph of A. tanguticus.(c) The process pipeline of A. tanguticus genome assembly and annotation.

Fig. 2 Fig. 3
Fig. 2 The evaluation of A. tanguticus genome size.(a) Genome scope profiles of 19-mer analysis.The X-axis represented the k-mer depth and the Y-axis represented the frequency of the k-mer for a given depth.(b) The flow cytometry of A. tanguticus.Endopolyploidy was observed in the genome of A. tanguticus.

Table 2 .
Summary of repeat contents in A. tanguticus.

Table 3 .
Statistical analysis of the gene structure of A. tanguticus genome.

Table 4 .
Statistical analysis of the gene annotations of A. tanguticus genome.Fig.4 The inferred phylogenetic tree of A. tanguticus and nine other species.A. tanguticus and A. acutangulus clustered together.

Table 5 .
Summary of expanded and contracted gene families among A. tanguticus and nine other species.

Table 6 .
GO enrichment analysis of the significantly expanded gene families in A. tanguticus.

Table 7 .
The differences in gene repertoires of A. tanguticus and A. acutangulus.